Skip to content

Use pin_memory in forward_batch.init_new to reduce decoding latency#21360

Open
litmei wants to merge 13 commits intosgl-project:mainfrom
litmei:decode_low_latency
Open

Use pin_memory in forward_batch.init_new to reduce decoding latency#21360
litmei wants to merge 13 commits intosgl-project:mainfrom
litmei:decode_low_latency

Conversation

@litmei
Copy link
Copy Markdown
Contributor

@litmei litmei commented Mar 25, 2026

Motivation

In low-latency scenarios, there are substantial idle gaps between Decode phases.

Modifications

Profiling analysis identified a Host-to-Device (H2D) synchronization bottleneck within forward_batch.init_new. Applying .pin_memory() converts this synchronous operation into an asynchronous one.

Accuracy Tests

Benchmarking and Profiling

Before:

image

After:

image

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@sglang-npu-bot
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@sglang-npu-bot
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

2 similar comments
@shadowxz109
Copy link
Copy Markdown
Contributor

/rerun-failed-ci

@shadowxz109
Copy link
Copy Markdown
Contributor

/rerun-failed-ci

Comment thread python/sglang/srt/model_executor/forward_batch_info.py
Comment on lines +815 to +827
if _pin:
mrope_positions_cat = torch.cat(
[pos for pos in mrope_positions_list],
dim=1,
).pin_memory()
else:
mrope_positions_cat = torch.cat(
[pos for pos in mrope_positions_list],
dim=1,
)
self.mrope_positions = mrope_positions_cat.to(
dtype=torch.int64, device=model_runner.device
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this part be simpler?

Comment thread python/sglang/srt/model_executor/forward_batch_info.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants